{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## AQLM inference example\n",
    "\n",
    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/streaming_example.ipynb\">\n",
    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
    "</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6egoxPVyckBF"
   },
   "source": [
    "**Install the `aqlm` library**\n",
    "- the only extra dependency to run AQLM models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "id": "A584OAwRWGks"
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "!pip install aqlm[gpu]==1.0.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hTfcs4lrc1x4"
   },
   "source": [
    "**Load the model as usual**\n",
    "\n",
    "Just don't forget to add:\n",
    " - `trust_remote_code=True` to pull the inference code\n",
    " - `torch_dtype=\"auto\"` to load the model in it's native dtype.\n",
    "\n",
    "The tokenizer is just a normal `Llama 2` tokenizer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nvShqlguccep"
   },
   "source": [
    "**Check that the output is what one would expect from Llama-2-7b**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM\n",
    "import transformers\n",
    "import torch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "quantized_model = AutoModelForCausalLM.from_pretrained(\n",
    "    \"ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf\", trust_remote_code=True, torch_dtype=torch.float16,\n",
    ").cuda()\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do a few forward passes to load CUDA and automatically compile the kernels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "output = quantized_model.generate(tokenizer(\"\", return_tensors=\"pt\")[\"input_ids\"].cuda(), max_new_tokens=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generate output using GPU streaming."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<s> An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, twenty-one, twenty-two, twenty-three, twenty-four, twenty-five, twenty-six, twenty-seven, twenty-eight, twenty-nine, thirty, thirty-one, thirty-two, thirty-three, thirty-four, thirty-five, thirty-six, thirty-seven, thirty-eight\n"
     ]
    }
   ],
   "source": [
    "inputs = tokenizer([\"An increasing sequence: one,\"], return_tensors=\"pt\")[\"input_ids\"].cuda()\n",
    "\n",
    "streamer = TextStreamer(tokenizer)\n",
    "_ = quantized_model.generate(inputs, streamer=streamer, max_new_tokens=120)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# On CPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "quantized_model = AutoModelForCausalLM.from_pretrained(\n",
    "    \"BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf\", trust_remote_code=True, torch_dtype=torch.float32,\n",
    ").cpu()\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compile AQLM numba kernel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
      "Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Compiling AQLM numba kernel with parameters: kernel_key=(8, 4096, 4096, 2)\n",
      "Compiling AQLM numba kernel with parameters: kernel_key=(8, 11008, 4096, 2)\n",
      "Compiling AQLM numba kernel with parameters: kernel_key=(8, 4096, 11008, 2)\n"
     ]
    }
   ],
   "source": [
    "output = quantized_model.generate(tokenizer(\"\", return_tensors=\"pt\")[\"input_ids\"].cpu(), max_new_tokens=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generate output using CPU streaming.\n",
    "**Warning:** collabs CPU is slow, please use more powerfull CPU for comfortable generation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
      "Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<s> An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, twenty-one, twenty-two, twenty-three, twenty-four, twenty-five, twenty-six, twenty-seven, twenty-eight, twenty-nine, twenty-ten, twenty-eleven, twenty-twelve, twenty-twenty, twenty-twenty, twenty-twenty-three, twenty-twenty\n"
     ]
    }
   ],
   "source": [
    "inputs = tokenizer([\"An increasing sequence: one,\"], return_tensors=\"pt\")[\"input_ids\"].cpu()\n",
    "\n",
    "streamer = TextStreamer(tokenizer)\n",
    "_ = quantized_model.generate(inputs, streamer=streamer, max_new_tokens=120)"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
